Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional file-based listings caching #895

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gutzbenj
Copy link
Contributor

@gutzbenj gutzbenj commented Jan 25, 2022

Dear @martindurant and fsspec team,

this PR adds an additional file-based listings cache with the help of the beautiful diskcache[1]. This is relevant as the listings cache docs mention that the creation of certain large listings can be really expensive.

Following changes are done:

  • rename memdircache.py module to dircache.py to represent general diraching (listings caching) and add argument to choose from both (MemDirCache is set to default)
  • rename DirCache class to MemDirCache
  • introduce new FileDirCache that implements all abstractmethods using given methods/attributes of diskcache.Cache class[2] and places the cache at a given location (if wanted)
  • parametrize listings cache tests to test both MemDirCache and FileDirCache
  • adjust relevant documentation

Please check it out and let me know for required changes!

Cheers
Benjamin

[1] http://www.grantjenks.com/docs/diskcache/
[2] http://www.grantjenks.com/docs/diskcache/api.html#diskcache.Cache

#787

CC @amotl

@amotl
Copy link
Contributor

amotl commented Jan 28, 2022

Dear Benjamin,

I want to thank you very much for tackling this improvement to fsspec. In order to shed more light on the background of this, I would like to reference earthobservations/wetterdienst#243, where we are coming from.

This improvement solves a longstanding issue for us, as the dbmfile-based backend of dogpile.cache proved to be too brittle when operating the Wetterdienst library in concurrent workload situations. I used to work with both Beaker and dogpile.cache in the past and they really shine in environments where you can easily afford to run a proper database-based cache backend, like Memcached, Redis, or even MongoDB.

However, this is not the situation with Wetterdienst: It is mostly library- and commandline-driven, so we didn't want to impose the overhead of running a caching database on its users.

As discussed at earthobservations/wetterdienst#243 (comment), I was very happy to learn about DiskCache by @grantjenks (thank you!) from you, as its features promised it to be the absolute right thing to use in a library context.

On my ramblings

It might even be suitable to accompany the fsspec-based caching implementation fsspec/caching.py.

@martindurant answered:

Certainly, fsspec would be happy to see alternative, more fully-featured caching schemes. Integration should generally be pretty simple.

In this manner, thank you again for making a start with this patch. I hope the integration will go along well.

With kind regards,
Andreas.


Further references

I discovered those resources in the issue tracker, which also indicate that the current fsspec/caching.py might have shortcomings in certain situations. Maybe those routines can also be improved by bringing DiskCache to the table?

@nbren12
Copy link

nbren12 commented Jan 28, 2022

Should/is this feature disabled by default? I couldn't tell from the code what the default behavior is. I could see it being pretty confusing if not. E.g. an application that monitors for changes in a bucket using fs.ls. Users are always free to wrap operations in their own caching layer no? e.g.

from toolz import memoize
# in memory
ls = memoize(fsspec.ls)
# on disk
from joblib import Memory
cache = Memory(".cache")
ls = cache.cache(fs.ls)

@martindurant
Copy link
Member

The dircache is used for more operations than simply caching the output of ls, so I don't think the simple approach above does everything you want.

@nbren12
Copy link

nbren12 commented Jan 28, 2022

I'm mainly just cautioning against caching by default...is it a default?

@martindurant
Copy link
Member

Yes, we cache file listings for remote filesystems, where getting such listings can be pretty expensive.

@nbren12
Copy link

nbren12 commented Jan 28, 2022

So this is the expected default behavior:

>>> fs.ls("gs://some-bucket")
[]
>>> fs.put("local_file", "gs://some-bucket")
>>> fs.ls("gs://some-bucket")
[]

I could imagine losing hours debugging something like that.

@martindurant
Copy link
Member

put is supposed to explicitly invalidate the parts of the cache it touches, so this is a bug to be fixed.

@nbren12
Copy link

nbren12 commented Jan 28, 2022

I'm not sure it is a bug. I didn't run the code above...I just meant it as an illustrative example. Suppose instead of fsspec.put someone used another tool or that they want to monitor changes made by other machines. The cache can easily become out of sync with reality.

@martindurant
Copy link
Member

Agreed, the cache can get out of sync when processes are happening elsewhere, so we have the ability to force refresh, set an expiry time or turn the listings cache off. However, the default is to cache for the duration of the python session, which seems (to me) to align with the most common use cases.

ci/environment-win.yml Outdated Show resolved Hide resolved
@gutzbenj gutzbenj force-pushed the add-local-listings-cache branch 2 times, most recently from a151320 to 17e926e Compare October 9, 2023 22:03
@gutzbenj
Copy link
Contributor Author

gutzbenj commented Oct 9, 2023

Dear all,

finally updated the PR. Could you please give me a review?

@amotl
Copy link
Contributor

amotl commented Oct 10, 2023

Hi Benjamin and Martin,

there is another discussion over at #1127, which I wanted to make you aware about. It may happen that this patch will get rejected, or not merged in its current form.

fsspec is very careful not to require any dependencies, DiskCache should only be optional.

If @martindurant agrees with my evaluation that it is indeed unlikely that it will get merged, I am thinking about shipping it on behalf of a separate utility package, in order to overcome the roadblock in making the recipe reusable beyond the Wetterdienst package, where it is currently used.

What do you think about this proposal?

With kind regards,
Andreas.

skip the caching altogether, use ``use_listings_cache=False``. That would be appropriate
when the target location is known to be volatile because it is being written to from other
sources. If you want to use the file-based caching, you can also provide the argument
``listings_cache_location`` to determine where the cache file is stored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think there's anything that can be done at this stage to simplify all these parameters? Should we have a caching_options dict, or config options (or both) ?

Copy link
Contributor Author

@gutzbenj gutzbenj Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear @martindurant ,

I updated the configurations and put them all into one dict named listings_cache_options to make clear what it refers to.

Probably options for [data] caching and listings caching should not be put together to keep it simple and because keywords e.g. "expiriy_time" are similarly named.

fsspec/dircache.py Outdated Show resolved Hide resolved
fsspec/dircache.py Outdated Show resolved Hide resolved
listings_expiry_time = listings_expiry_time and float(listings_expiry_time)

if listings_cache_location:
listings_cache_location = Path(listings_cache_location) / str(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we use Path within a filesystem library :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can diskcache handle fsspec paths? E.g., memory://cache would be a very similar solution to MemDirCache. Such an option might even be useful for semi-speedy remotes like smb:

fsspec/dircache.py Outdated Show resolved Hide resolved
@@ -103,6 +103,8 @@ def __init__(
request_options.pop("listings_expiry_time", None)
request_options.pop("max_paths", None)
request_options.pop("skip_instance_cache", None)
request_options.pop("listings_cache_type", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic - where else might these kwargs pollute calls? Do we need to change the signatures of all backends?

fsspec/implementations/tests/test_http.py Show resolved Hide resolved
fsspec/spec.py Outdated Show resolved Hide resolved
tox.ini Outdated Show resolved Hide resolved
@martindurant
Copy link
Member

Given progress here, I think we can fix this up, and see whether it garners any interest.

fsspec/asyn.py Outdated
@@ -297,15 +297,15 @@ class AsyncFileSystem(AbstractFileSystem):
mirror_sync_methods = True
disable_throttling = False

def __init__(self, *args, asynchronous=False, loop=None, batch_size=None, **kwargs):
def __init__(self, *args, asynchronous=False, loop=None, batch_size=None, listings_cache_options=None, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we name it dir_cache_options instead to adhere with MemoryDirCache/FileDirCache?

fsspec/spec.py Outdated
Comment on lines 150 to 164
if not listings_cache_options:
listings_cache_options = {}
elif listings_cache_options is True:
listings_cache_options = {"cache_type": CacheType.MEMORY, "expiry_time": None}
else:
listings_cache_options = {"expiry_time": None} | listings_cache_options
cache_type = listings_cache_options.get("cache_type")
if cache_type:
if isinstance(cache_type, str):
try:
cache_type = CacheType[cache_type.upper()]
except KeyError as e:
raise ValueError(
f"Cache type must be one of {', '.join(ct.name.lower() for ct in CacheType)}") from e
listings_cache_options["cache_type"] = cache_type
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's too much going on here!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine! It could be a norm_cache_options util functions to pull it out of here.

@martindurant
Copy link
Member

>           self.dircache[path.rstrip("/")] = files
E           TypeError: 'NoneType' object does not support item assignment

It seems like the "no cache" case should still have a no-op dict to cope with this kind of thing. So, there would be a third cache type, one that corresponds to the use_listing_cache = False from before.

@martindurant
Copy link
Member

@amotl @gutzbenj : this PR seems to have stalled. I know there's quite a thread above, but what do you think we can do from here to get this useful functionality in?

@gutzbenj
Copy link
Contributor Author

Dear @martindurant ,

thanks for the reminder. I think I need just a bit of time. I will work on this in about 2 weeks again and push it actively.

@gutzbenj gutzbenj force-pushed the add-local-listings-cache branch 21 times, most recently from 029d267 to eb42007 Compare May 10, 2024 16:40
@gutzbenj
Copy link
Contributor Author

Dear @martindurant ,

still on it...

Currently some tests are failing which I believe I'm not responsible for. Maybe some reallife backend has changed or so.

@martindurant
Copy link
Member

This seems like it might be real:

FAILED fsspec/implementations/tests/test_http.py::test_list_cache_memory[None] - assert (0.002689361572265625 / 0.0018138885498046875) > 1.5

Could it be that the file cache is causing a significant slow-down? Perhaps we just need to alter the test a bit to allow for a smaller speedup factor like 1.1 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants